[skyrl-train] Refactor TIS to use more comprehensive off policy correction config #849

erictang000 · 2026-01-07T00:41:24Z

Overview

Marks trainer.algorithm.use_tis and trainer.algorithm.tis_imp_ratio_cap for deprecation
Introduces new trainer.algorithm.off_policy_correction config (see new config below)
Updates loss functions to return a LossMetrics TypedDict containing loss metrics (previously returned just loss, clip_ratio)
Updates workers to all reduce mean/max/min appropriately, and to propagate loss metrics back up to the trainer.

Off Policy Correction Config

# To be deprecated in favor of off_policy_correction.tis_ratio_type = "token"
# and "token_tis_ratio_clip_high"
tis_imp_ratio_cap: -1.0
use_tis: false

off_policy_correction:
      # type of importance sampling ratio to use for ppo loss correction
      # here importance sampling ratio refers to exp(logprobs_{policy_old} - logprobs_{rollout_policy})
      tis_ratio_type: null # null, "token", "sequence"

      # used if tis_ratio_type = "token", 1.5-5.0 is recommended for "token" tis_ratio_type
      token_tis_ratio_clip_high: 2.0
      # used if tis_ratio_type = "sequence", 2.0-10.0 is recommended for "sequence" tis_ratio_type
      sequence_tis_ratio_clip_high: 5.0

      # method of masking out sequences with cumulative importance sampling ratios outside the cap
      # "product" masks out sequences with product of importance ratios outside the cap
      # "geometric" masks out sequences with geometric mean of importance ratios outside the cap
      sequence_mask_metric: null # null, "product", "geometric"

      # used if sequence_mask_metric = "geometric"
      # values around 0.99-1.01 are recommended for "geometric" sequence_mask_metric - MoE models may need larger allowed ranges due to higher mismatch
      geo_mask_high: 1.01
      geo_mask_low: 0.99

      # used if sequence_mask_metric = "product"
      # values around 0.5-2.0 are recommended for "sequence" sequence_mask_metric
      product_mask_high: 2.0
      product_mask_low: 0.5

      # separate from sequence_mask_metric and tis_ratio_type 
      # if any off_policy_correction is enabled, masks out sequences with any token having importance ratio
      # far outside an acceptable range (low and high thresholds)
      outlier_token_is_threshold_low: 1e-4
      outlier_token_is_threshold_high: 100

…out_correction

gemini-code-assist

Code Review

This pull request refactors the Truncated Importance Sampling (TIS) configuration into a more comprehensive rollout_correction system, which is a great improvement for structure and extensibility. The new implementation adds flexible rollout correction mechanisms, including different TIS ratio types and rejection masks. The changes are well-documented and handle the deprecation of old parameters gracefully. I've identified a bug in a conditional check that could cause a crash, and an opportunity to refactor for better efficiency and code clarity. My detailed feedback is in the comments below.

skyrl-train/skyrl_train/trainer.py

skyrl-train/skyrl_train/utils/ppo_utils.py

… and min

…kyRL into rollout_correction

… unite metrics under loss_metrics, other clean up

…out_correction

erictang000 · 2026-01-13T00:28:48Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a significant refactoring of the off-policy correction mechanism, replacing the simple TIS flags with a more comprehensive off_policy_correction configuration. This is a great improvement for flexibility and experimentation. The changes are well-implemented across the codebase, including documentation, examples, and tests. I've identified a few critical bugs in the implementation and some areas for improvement in the examples and utility functions to enhance clarity and correctness. Please see the detailed comments below.

skyrl-train/skyrl_train/utils/utils.py

skyrl-train/skyrl_train/workers/megatron/megatron_worker.py

skyrl-train/docs/configuration/config.rst

skyrl-train/examples/flash_rl/run_dapo_gsm8k_flashrl_0.5b_fp8.sh

skyrl-train/examples/flash_rl/run_dapo_gsm8k_flashrl_0.5b_int8.sh

skyrl-train/examples/megatron/run_megatron.sh

skyrl-train/skyrl_train/config/ppo_base_config.yaml

skyrl-train/skyrl_train/distributed/strategy.py

skyrl-train/tests/cpu/algorithms/test_losses.py

erictang000 · 2026-01-13T00:52:33Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a significant and well-executed refactoring of the off-policy correction mechanism. Moving from a simple TIS flag to a comprehensive off_policy_correction configuration block is a major improvement in flexibility and power. The introduction of the LossMetrics TypedDict is a great step towards better type safety and code clarity. The changes are consistently applied throughout the codebase, including documentation, examples, and tests, which is commendable. The new tests for the off-policy correction logic are particularly thorough. I have a couple of suggestions regarding a hardcoded path in an example script and a potential bug in the distributed strategy logic, which I've detailed in the comments.

skyrl-train/skyrl_train/distributed/strategy.py

skyrl-train/examples/megatron/run_megatron_dapo_qwen3_30b_a3b_lora.sh

erictang000 added 2 commits January 7, 2026 00:40

x

f033e65

Merge branch 'main' of https://github.com/erictang000/SkyRL into roll…

0b236fe

…out_correction

gemini-code-assist bot reviewed Jan 7, 2026

View reviewed changes

skyrl-train/skyrl_train/trainer.py Outdated Show resolved Hide resolved

skyrl-train/skyrl_train/utils/ppo_utils.py Show resolved Hide resolved

erictang000 added 12 commits January 7, 2026 00:50

x

3f3b759

x

29efd6f

x

1520157

x

45a59c2

fix tests and add rollout correction to other loss types

ce01bb2

add metrics

abac800

propagate metrics up and refactor how we do metric reductions for max…

2dc7364

… and min

make default null and propagate megatron metrics

349369d

x:

f3f7054

Merge branch 'rollout_correction' of https://github.com/erictang000/S…

c45c130

…kyRL into rollout_correction

big cleanup - remove clip_ratio return (fix custom algorithms stuff),…

63d38c5

… unite metrics under loss_metrics, other clean up

x

7e83c10

erictang000 changed the title ~~[skyrl-train] Refactor TIS to use more comprehensive rollout correction config~~ [skyrl-train] Refactor TIS to use more comprehensive off policy correction config Jan 8, 2026

erictang000 added 8 commits January 8, 2026 23:17

renaming

cf042fc

x

cef7121

x

9485bdd

Merge branch 'main' of https://github.com/erictang000/SkyRL into roll…

9e11eda

…out_correction

x

c06747c

x

0b5ebfd

x

0697957

add docs

6b9e1e4

gemini-code-assist bot reviewed Jan 13, 2026

View reviewed changes

x

46b6fe5

erictang000 requested a review from tyler-griggs January 13, 2026 00:52

x

d72d9c6

gemini-code-assist bot reviewed Jan 13, 2026

View reviewed changes

skyrl-train/skyrl_train/distributed/strategy.py Outdated Show resolved Hide resolved

skyrl-train/examples/megatron/run_megatron_dapo_qwen3_30b_a3b_lora.sh Outdated Show resolved Hide resolved

skyrl-train/examples/megatron/run_megatron_dapo_qwen3_30b_a3b_lora.sh Outdated Show resolved Hide resolved

erictang000 added 2 commits January 13, 2026 01:08

gemini:

db76d01

x

ac0659c

[skyrl-train] Refactor TIS to use more comprehensive off policy correction config #849

Are you sure you want to change the base?

[skyrl-train] Refactor TIS to use more comprehensive off policy correction config #849

Uh oh!

Conversation

erictang000 commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Off Policy Correction Config

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

erictang000 commented Jan 13, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

erictang000 commented Jan 13, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

erictang000 commented Jan 7, 2026 •

edited

Loading